-
-
Notifications
You must be signed in to change notification settings - Fork 9.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add cchardet support and use chardet as fallback #4115
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a lot of duplicated logic in your change request. Can any of it be de-duplicated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we need to try to deduplicate this logic a lot. Do you mind doing that?
sure :-) |
requests/packages.py
Outdated
if isinstance(_package, tuple): | ||
package = _package[1] | ||
try: | ||
locals()[package] = __import__(package[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't look like this line is correct, should be __import__(_package[0])
? Might want to unpack the tuple right away and then use proper variable names so the logic is easier to follow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's an issue here with the warning that we're raising.
requests/__init__.py
Outdated
assert minor >= 1 | ||
assert patch >= 0 | ||
except AssertionError: | ||
warnings.warn('Requests dependency \'cchardet\' must be version >= 2.1.0! Falling back to chardet') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't specifying the upper pin, which is < 3.0.0
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @2mf, thanks for putting this together! While I'm not opposed to supporting cchardet
, I do have some questions/comments on the current implementation.
tox.ini
Outdated
|
||
[testenv] | ||
|
||
deps = | ||
py{26,27,33,34,35,36}-chardet: chardet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not super fond of doubling the number of test instances we're running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not quite sure, how I can test against ccharset and charset using existing cases
requests/__init__.py
Outdated
@@ -56,18 +56,37 @@ | |||
except AssertionError: | |||
raise RuntimeError('Requests dependency \'urllib3\' must be version >= 1.21.1, < 1.22!') | |||
|
|||
import warnings | |||
|
|||
def get_version(package): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are going to abstract this out to a function, maybe we add it to the urllib3 import to reduce the duplicated code to a single place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure I'll do that
requests/__init__.py
Outdated
warnings.warn('Requests dependency \'cchardet\' must be version >= 2.1.0! Falling back to chardet') | ||
raise | ||
|
||
except (ImportError, SyntaxError, AssertionError) as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the scenario that raises a SyntaxError
? Also if we're not using e
for anything, we probably don't need to set it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I copied that from other places in requests, but from my view point it is not needed
locals()[package] = __import__(package) | ||
for package in ('urllib3', 'idna', ('chardet', 'cchardet')): | ||
if isinstance(package, tuple): | ||
package, alt_package = package |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to be extremely careful about touching this code, at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. This looks safe, but it'd be so much better if this were testable.
This whole thing scares me. |
I for one am not wedded to doing it: I'm nervous about the amount of code it adds as well. |
@2mf do you have any metrics on how much faster cchardet is than chardet? e.g. is it worth the trouble? |
@kennethreitz a benchmark can be seen at https://pypi.python.org/pypi/cchardet
|
@kennethreitz @Lukasa shouldn't be the version matching in the requirements of setup.py?
and
|
That is significantly faster! |
remove unnecessary SyntaxError pin cchardet to >=2.1.0 <2.2.0
This reverts commit b231590.
No, @2mf it shouldn't. Requests needs very specific versions of dependencies. Some of those dependencies are popular and we need a way to tell the user "You're using a version that will horribly break your version of Requests" |
assert minor < 2 | ||
assert patch >= 0 | ||
except AssertionError: | ||
warnings.warn('Requests dependency \'cchardet\' must be version >= 2.1.0, < 2.2.0! Falling back to chardet') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a dependency of Requests, it's an optional speed-up. Please revise
# cchardet >= 2.1.0 | ||
try: | ||
assert major == 2 | ||
assert minor >= 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not:
assert 1 <= minor < 2
except (ImportError, AssertionError): | ||
# Check chardet for compatibility. | ||
import chardet | ||
major, minor, patch = get_version(chardet) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
chardet
shouldn't be imported in this except block. This will cause extremely confusing tracebacks for users if they can't import either.
@@ -93,6 +93,7 @@ def run_tests(self): | |||
cmdclass={'test': PyTest}, | |||
tests_require=test_requirements, | |||
extras_require={ | |||
'cchardet:(python_version == "2.7" or python_version >= "3.4")': ['cchardet>=2.1.0'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're enforcing (in __init__.py
) that cchardet
be between 2.1.0
and 2.2.0
then that should be written here.
locals()[package] = __import__(package) | ||
for package in ('urllib3', 'idna', ('chardet', 'cchardet')): | ||
if isinstance(package, tuple): | ||
package, alt_package = package |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. This looks safe, but it'd be so much better if this were testable.
And harder to install and the benchmark, as I understand it, is based off a very old version of chardet and hasn't been updated since. |
It sounds like we're against this change overall. Sorry @2mf. |
I'm on board with doing this, but the current methodology isn't 100% correct and I'm merely pointing out that those benchmarks are older and potentially misleading. @dan-blanchard has done a lot of fantastic work on chardet recently |
I feel the need to point out that chardet is in the midst of gaining support for more codecs and languages than cchardet supports. I'm retraining all the models for greater accuracy too. Speed isn't everything with encoding detection. |
Just for the sake of a recent benchmark
anyway, It is not worth to continue on this one while running against so many walls |
I hadn't actually looked at the benchmark code cChardet uses before, but it just runs the same small SJIS text file through both modules 100 times. I know cChardet will pretty much always be faster than chardet—in fact, if you look at old chardet issues, you'll even see that I once toyed with the idea of trying to merge the projects—but that is not an accurate benchmark, especially for something like requests where a good amount of the time is spent downloading the page content anyway. A requests-based benchmark would make sense for deciding if it makes a difference here. |
I created this pull-request for #4112